fix(qwen-asr): enable timestamp output when forced_aligner is configured#10013
Conversation
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Adds support for returning and parsing word/segment timestamps when the Qwen ASR forced aligner is available.
Changes:
- Detects presence of a forced aligner and requests timestamps from
model.transcribe. - Adds parsing for forced-aligner timestamp objects (
start_time,end_time,text) in addition to tuple/list timestamps.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| results = self.model.transcribe( | ||
| audio=audio_path, language=language, context=context, | ||
| return_time_stamps=has_aligner, | ||
| ) |
| if hasattr(ts, 'start_time') and hasattr(ts, 'end_time') and hasattr(ts, 'text'): | ||
| # ForcedAlignItem dataclass (from qwen_asr forced aligner) | ||
| start_ms = int(ts.start_time * 1000) if ts.start_time is not None else 0 | ||
| end_ms = int(ts.end_time * 1000) if ts.end_time is not None else 0 | ||
| seg_text = ts.text or "" | ||
| elif isinstance(ts, (list, tuple)) and len(ts) >= 3: |
Two bugs prevented timestamps from working in the qwen-asr backend: 1. transcribe() was called without return_time_stamps=True, so the forced aligner was loaded but never invoked. Now we pass return_time_stamps=True when a forced_aligner is present. 2. The timestamp parsing code expected (list, tuple) items, but the qwen_asr library returns ForcedAlignItem dataclass instances with .text, .start_time, .end_time attributes. Added hasattr() check to handle this correctly, falling back to tuple parsing for backward compatibility.
- Wrap return_time_stamps kwarg in try/except TypeError for safety - Add defensive float() normalization for timestamp times - Use str() for text extraction to ensure string type
4f283ba to
346c5d2
Compare
The Go server reads TranscriptSegment.start/end via time.Duration, which is in nanoseconds. Previously the backend sent milliseconds (* 1000), causing timestamps to be 1000x too small (e.g. 8e-8 instead of 0.08). Convert seconds → nanoseconds (* 1e9) instead. Also applies to the legacy tuple path for consistency.
Additional fix: seconds → nanosecondsWhile testing this PR against a real deployment, I discovered a third issue beyond the two bugs described above: Bug 3: Timestamp unit mismatch between Python backend and Go serverThe Go server reads (int64) and wraps them in : // core/backend/transcript.go
segments = append(segments, &schema.TranscriptSegment{
Start: time.Duration(s.Start),
End: time.Duration(s.End),
})Go's Fix: Convert seconds → nanoseconds ( Verified outputAfter applying all three fixes, {"segments": [{"id": 0, "start": 0.08, "end": 0.24, "text": "今"}, ...]}Pushed the additional commit to the PR branch. |
Read request.timestamp_granularities from the gRPC request. - 'word': return one segment per aligned item (character / word) - 'segment' (default): merge consecutive items at sentence boundaries Sentence boundaries detected via CJK punctuation (。!?;…) and Latin endings (. ! ? ;). This matches the OpenAI Whisper API contract where omitting the parameter defaults to segment-level.
Unicode curly quotes (U+2018/2019) were being interpreted as Python string delimiters, causing SyntaxError. Use explicit unicode escapes.
The forced aligner strips punctuation from its output, so text-based sentence detection doesn't work. Instead, detect segment boundaries by measuring time gaps between consecutive aligned items. Threshold = max(median_gap * 4, 0.3s). This cleanly separates intra-sentence gaps (< 0.24s) from inter-sentence gaps (> 0.3s) across Chinese, English, and other languages.
The forced aligner strips whitespace from tokenized text, so English words like ['hello', 'world'] were joined as 'helloworld'. Add _smart_join() that inserts spaces between non-CJK tokens while keeping CJK characters and punctuation unspaced. Works for Chinese, English, Korean, Japanese, and mixed-language text.
Problem
The qwen-asr backend loads the
forced_alignermodel correctly but never actually produces timestamps. All segments returnstart=0, end=0.Two bugs cause this:
Bug 1:
return_time_stampsnot passed totranscribe()Qwen3ASRModel.transcribe()defaultsreturn_time_stamps=False. The backend never passesTrue, so the forced aligner is loaded but silently skipped during inference.Bug 2: Timestamp item format mismatch
The parsing code checks
isinstance(ts, (list, tuple)), butqwen_asrreturnsForcedAlignItemdataclass instances with.text,.start_time,.end_timeattributes — not tuples. The check always fails, so timestamps are zeroed out even if Bug 1 were fixed.Fix
return_time_stamps=Truetotranscribe()when aforced_aligneris loaded.hasattr()check forForcedAlignItemdataclass before falling back to tuple parsing.Testing
Verified against
qwen3-asr-0.6bwithQwen/Qwen3-ForcedAligner-0.6B— timestamps now return correctly inverbose_json,srt, andvttformats.